Featurization, Model Selection & Tuning-Module Project


A complex modern semiconductor manufacturing process is normally under constant surveillance via the monitoring of signals/variables collected from sensors and or process measurement points. However, not all of these signals are equally valuable in a specific monitoring system. The measured signals contain a combination of useful information, irrelevant information as well as noise. Engineers typically have a much larger number of signals than are actually required. If we consider each type of signal as a feature, then feature election may be applied to identify the most relevant signals. The Process Engineers may then use these signals to determine key factors ontributing to yield excursions downstream in the process. This will enable an increase in process throughput, decreased time to learning nd reduce the per unit production costs. These signals can be used as features to predict the yield type. And by analyzing and trying out different combinations of features, essential signals that are impacting the yield type can be identified.

The data consists of 1567 examples each with 591 features.
The dataset presented in this case represents a selection of such features where each example represents a single production entity with associated measured features and the labels represent a simple pass/fail yield for in house line testing. Target column “ –1” corresponds to a pass and “1” corresponds to a fail and the data time stamp is for that specific test point.

Key facts: Data Structure: SECOM data consisting of 1567 examples each with 591 features a 1567 x 591 matrix and a labels file containing the classifications and date time stamp for each example.
As with any real life data situations this data contains null values varying in intensity depending on the individuals features. This needs to be taken into consideration when investigating the data either through pre-processing or within the technique applied.

The data is represented in a raw text file each line representing an individual example and the features seperated by spaces. The null values are represented by the 'NaN' value as per MatLab.

We will build a classifier to predict the Pass/Fail yield of a particular process entity and analyse whether all the features are required to build the model or not.


1. Import and explore the data.

2. Data cleansing:

    • Missing value treatment.

    • Drop attribute/s if required using relevant functional knowledge.

    • Make all relevant modifications on the data using both functional/logical reasoning/assumptions.

Checking for columns like columns with near zero variance.
For this we can calculate number of unique values in each column and if there is only one or very few unique values. We can delete these columns as they hold no information or predictive power for the models to use

  In order to deal with missing values, we check the missing values by each observation and variable. The figure above shows that there are many variables have no more than 17.42% of missing values. However, High percentage of missing values do occur in some of the variables ranging from 45.62% to 91.13%. Thus, we drop the variables which contain more than 40% missing values and impute the remaining variables.

     There are several methods for missing value imputation such as k nearest neighbor, regression, random forest. K nearest neighbor (KNN) represents a natural improvement of Mean that exploits the observed data structure. Multivariate Imputation by Chained Equations (MICE) are based on a much more complex algorithm and its behavior appears to be related to the size of the dataset, however it becomes time-intensive when applied to the large datasets. So, we conduct KNN to impute the missing values which is relatively a efficient algorithm while dealing with high dimensional data.

3. Data analysis & visualization:

    • Perform detailed relevant statistical analysis on the data.

    • Perform a detailed univariate, bivariate and multivariate analysis with appropriate detailed comments after each analysis.

We can see that most of the features have mean approximately equal to median, implying that these features are following a a normal/guassian distribution which is helpful.

The SECOM (Semiconductor Manufacturing) dataset, consists of manufacturing operation data and the semiconductor quality data. It contains 1567 observations taken from a wafer fabrication production line. Each observation is a vector of 590 sensor measurements plus a label of pass/fail test. Also, there are only 104 fail cases which are labeled as positive (encoded as 1), whereas much larger amount of examples pass the test and are labeled as negative (encoded as -1). This is a 1:14 proportion (6.64%), which is heavily imbalanced. So, we will have to use certain upsampling techniques so that the models aren't biased towards Pass Case

A considerable amount of features still have a relatively low variance (number of distinct values ~= number of values) A lot of features have guassian distributions A few features also have non-standard distributions

Few features seem to have a strong relationship with the features

After feature selection by removing the highly correlated variables, we do not have features that are highly correlated amongst each other anymore.

4. Data pre-processing:

    • Segregate predictors vs target attributes

    • Check for target balancing and fix it if found imbalanced.

    • Perform train-test split and standardise the data or vice versa if required.

    • Check if the train and test data have similar statistical characteristics when compared with original data.

The train and test distributions have shifted a bit juding by the shift in means and medians for various features due to the SMOTE up-sampling. But, they do seem to be following the same distribution as the underlying feature set making them suitable to train models on.

5. Model training, testing and tuning:

    • Model training:

        ○ Pick up a supervised learning model.

        ○ Train the model.

        ○ Use cross validation techniques.

          (Hint: Use all CV techniques that you have learnt in the course.)

        ○ Apply hyper-parameter tuning techniques to get the best accuracy.

          (Suggestion: Use all possible hyper parameter combinations to extract the best accuracies.)

        ○ Use any other technique/method which can enhance the model performance.

          (Hint: Dimensionality reduction, attribute removal, standardisation/normalisation, target balancing etc.)

        ○ Display and explain the classification report in detail.

        ○ Design a method of your own to check if the achieved train and test accuracies might change if a different sample population can lead to new train and test accuracies.

          (Hint: You can use your concepts learnt under Applied Statistics module.)

        ○ Apply the above steps for all possible models that you have learnt so far.

    • Display and compare all the models designed with their train and test accuracies.

    • Select the final best trained model along with your detailed comments for selecting this model.

    • Pickle the selected model for future use.

    • Import the future data file. Use the same to perform the prediction using the best chosen model from above. Display the prediction results.

We can see that accuracies have changed and also, f1 metric has changed indicating that the sample and sample distribution have a significant effect on model training. Hence, we have to be careful while sampling and not introduce bias or human error before training models so that the models are able to generalize well to our population as well.

Feature '33' seems to be the strongest predictor for the pass/fail target, followed by weaker predictors 126, 511, 429 and other weak predictors.

This is a binary classification problem, where the machine learning model will try to predict if each row is -1 or 1. The majority class is -1, which occurs in 93.36% of the observations. Since we care about the negative class (minority calls). AUC is independent of the chosen threshold. The chosen metric is AUC as a single value to judge the models as it isn't affected by imbalanced classes.

Hence, Random Forest and KNN Classifiers are chosen to train a meta modle so that the results are more generalizable to unseen data. Also, we pick Random Search obtained models because RandomSearchCV is proven to be better over GridSearchCV to find optimum hypreparameters as it doesn't search in a predifined grid search space but picks randomly from the distributions of hyperparmeters from the param search space.

Meta Classifier is simply the classifier that makes a final prediction among all the predictions by using those predictions as features. So, it takes classes predicted by various classifiers and pick the final one as the result that you need.

image.png

Random Forest Classifier has the best AUC score, but selecting the stacking classifier so that it depends on multiple algorithms and is able to generalize well. So, the chosen champion model for this dataset is the meta classifier trained on outputs from multiple well-performing models so that the model generalzes to unseen data well. The AUC metric of the stacking classifier is 0.994 and accuracy is 98% on validation set.

Predict on Test Set to validate and test generalizibility

6. Conclusion and improvisation:

We have chosen the stacking classifier model with AUC: 0.994 as the final model. To improve on the results, we could perform other feature elimination techniques like Forward Selection or Backward Elimination or use PCA to reduce dimensionality in the features. We'll have to go to a deep learning based model to be able to more effectively learn complex patterns between many features. We could also train an anomaly detection algorithm which works on imbalanced datasets to detect spam, fraud, failure ..etc., Also, we need more amount of data relative to the number of features if our trained model has to perform well on unseen data.